Aligning torchtext vocab index to loaded embedding pre-trained weights

Todd_Morrill · June 27, 2020, 8:38pm

Hi Naresh - I saved two crucial artifacts as part of fasttext training in gensim and surely these techniques will apply to word2vec training in gensim as well.

First, I saved the word embeddings in the w2v text file format.

...
    # save w2v format since this is useful for PyTorch
    if save_w2v:
        w2v_out_filepath = os.path.join(save_dir, f'{file_name}_w2vformat.txt')
        model.wv.save_word2vec_format(w2v_out_filepath)
        print(f'Saved {w2v_out_filepath}')
...

Second, I saved the word frequency Counter dictionary that is required by torchtext.vocab.Vocab.

...
   # save word frequencies since this is useful for PyTorch
    counts = Counter(
        {word: vocab.count
         for (word, vocab) in model.wv.vocab.items()})
    freq_filepath = os.path.join(save_dir, f'{file_name}_word_freq.json')
    save_json(freq_filepath, counts)
    print(f'Saved {freq_filepath}')
...

Then I load these files into torchtext.vocab.Vectors and torchtext.vocab.Vocab objects, respectively, as described above, and it all comes together beautifully.

I hope that helps!